Smoking is one of the leading causes of death and is a major risk factor for lung cancer. We want to determine if it can be a risk factor for cardiovascular diseases (CVD). We looked at the effect of smoking on systolic blood pressure. By using different visualization techniques, we did not observe anything indicating a correlation between smoking and hypertension. A high smoking activity seems to be associated with a decrease in blood pressure but there were no statistical tests performed.
The Framingham Heart is an observational study that aim to identify common factors that lead to the occurrence of cardiovascular diseases.1 We will use a modified data which was collected and publicly made available by the National Heart, Lung, and Blood Institute (NHLBI). Some common factors that were recorded in the study include for example age, BMI, blood glucose or Cortisol level of the patients. A small proportion of the data can be visualized in the table below.
heartData = read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/frmgham.csv")
head(heartData, n = 5)
In this report, we will be interested in determining if the number of cigarettes smoked per day affects blood pressure. It is well-known that an elevated blood pressure (hypertension) is a risk factor for cardiovascular diseases.2 Thus if a high number of cigarettes smoked per day is associated with hypertension, we could consider smoking as a risk factor for CVD.
To do so, the dataset provides us with different interesting variables such as CIGPDAY, SYSBP and DIABP which respectively correspond to the number of cigarettes smoked each day, the Systolic Blood Pressure in mmHg and the Diastolic Blood Pressure also in mmHg. A report from The University of Harvard affirms that a greater risk of CVD has been associated with higher systolic pressures compared with elevated diastolic pressures .3 Therefore we will study the correlation between CIGPDAY and SYSBP. A table containing the first values of the dataset can be found below. We made sure to get rid of any missing values present in the original dataset.
smokeBPdata = heartData %>%
filter(CIGPDAY != "NA", SYSBP != "NA") %>%
select(c(CIGPDAY, SYSBP))
head(smokeBPdata)
We can start by visualising the distribution of the two variables in the plot below.
p1 = smokeBPdata %>% ggplot() +
aes(x = CIGPDAY) +
geom_histogram(bins = 10, color = "blue", fill = "lightblue") +
xlab("Cigarette(s) smoked per day") + ylab("")
p2 =smokeBPdata %>% ggplot() +
aes(x = SYSBP) +
geom_histogram(bins = 20, color = "blue", fill = "lightblue") +
xlab("Systolic Blood Pressure (mmHg)") + ylab("")
gridExtra::grid.arrange(p1, p2, ncol =2, top = "Distribution of the variables CIGPDAY and SYSBP from the Framingham Heart Study ")
From both of these histograms, we can make some important affirmations regarding the data. First of all, the majority of patients are non-smokers as the first bar is longer than the sum of all the other bars on the left-hand side histogram. The distribution of the systolic blood pressure is closer to a normal distribution but is skewed to the right which indicates that the data gets more spread as the blood pressure increase. We could imagine that the few patients who smoke also represent those who have a high systolic blood pressure and thus smoking would be associated with hypertension. However, this is just a hypothesis, we cannot make any statistical conclusion yet.
Let’s now have a look at an interactive scatter plot comparing both variables which will indicate if there is a relationship between smoking and systolic blood pressure.
plot_ly(
data = smokeBPdata,
x = ~CIGPDAY,
y = ~SYSBP,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Effect of number of cigarettes smoked on blood pressure",
xaxis = list(title = "Cigarette(s) smoked per day"),
yaxis = list(title = "Systolic Blood Pressure (mmHg)")
)
Looking at the graph above, it is hard to make any affirmation about the data. We see that patients who do not smoke all have different blood pressure as the value ranges from 85.5 to 295 mmHg. We see a similar range of values for patients who smoke 20 cigarettes per day. However, it seems that patients who smoke more than 40 cigarettes have a lower blood pressure than other patients with only 2 patients having a blood pressure greater than 200 mmHg. However, we cannot make any conclusion as the number of patients that do not smoke is far greater than for example those who smokes 70 cigarettes (n =3). We could use a different visualisation technique called 2D HeatMap to represent the data and see if we can get a better understanding of the effect of smoking on systolic blood pressure.
plot_ly(data = smokeBPdata, x = ~CIGPDAY, y = ~SYSBP) %>%
add_trace(type='histogram2dcontour') %>%
layout(
title = "Heatmap of the number of cigarettes smoked against blood pressure",
xaxis = list(title = "Cigarette(s) smoked per day"),
yaxis = list(title = "Systolic Blood Pressure (mmHg)")
)
Unfortunately, because the number of patients smoking zero cigarette per day is far greater than any other group of patient, the scale used to change the color is not appropriate and we cannot make any conclusion on the data using the heatmap. To counter such a problem, we can convert one of the continuous variable to categorical.
We will add a new column to our dataframe which corresponds to the smoking status. We arbitrary define the following smoking status: null as smoking 0 cigarette, light as smoking between 1 and 9 cigarettes a day, medium as smoking between 10 and 20 cigarettes a day and heavy as smoking more than one pack per day (>20 cigarettes). Once again, a small proportion of the data can be seen in the table below.
smokedata2 = smokeBPdata %>% mutate(
smoking_status = case_when(CIGPDAY == 0 ~ "null",
CIGPDAY >0 & CIGPDAY <10 ~ "light",
CIGPDAY >= 10 & CIGPDAY <= 20 ~"medium",
CIGPDAY > 20 ~ "heavy")
)
head(smokedata2)
Now, that our data is divided into four groups we can plot a violin representing the distribution with a boxplot for each group.
plot_ly(
data = smokedata2,
x = ~smoking_status,
y = ~SYSBP,
color = ~smoking_status,
type = "violin",
colors = c("firebrick", "gold", "darkorange", "green4"),
box = list(visible = TRUE)
) %>%
layout(
title = "Violin representation of the smoking status against blood pressure",
yaxis = list(title = "Systolic Blood Pressure (mmHg)"),
xaxis = list(title = "Smoking category status", categoryorder = "array",categoryarray = c("null", "light", "medium", "heavy"))
)
We now have a better representation of how smoking affects blood pressure. The distribution of each violin seems to vary especially if we focus on the upper adjacent values. We can see that the green violin has a very different shape than the red one. It seems that smoking more than one pack of cigarette a day decrease blood pressure. When treating statistical questions, it is common to use the mean to test a null hypothesis. When comparing the green boxplot to the yellow, orange and red plot we can see that the mean seems to decrease slightly but it is hard to determine if this decrease is significant. Therefore, a high number of cigarettes smoked might be associated with a low blood pressure.
After visualising in different ways the number of cigarettes smoked per day against the systolic blood pressure, we found some visualization showing that smoking might decrease blood pressure. However, some statistical test is required to determine if the slight difference observed is significant. We certainly did not visualise anything indicating that smoking is associated with hypertension. Consequently, we do not have sufficient results to consider the number of cigarettes smoked per day as a risk factor for CVD.
framinghamheartstudy.org. (n.d.). Framingham Heart Study. [online] Available at: https://framinghamheartstudy.org/fhs-about/↩︎
Fuchs, F.D. and Whelton, P.K. (2020). High Blood Pressure and Cardiovascular Disease. Hypertension, 75(2), pp.285–292. Available at: https://www.ahajournals.org/doi/10.1161/HYPERTENSIONAHA.119.14240↩︎
Harvard Health. (2018). Which blood pressure number is important? [online] Available at: https://www.health.harvard.edu/staying-healthy/which-blood-pressure-number-is-important#↩︎